Combining optimization for cache and instruction-level parallelism

نویسنده

Steve Carr

چکیده

Current architectural trends in instruction-level parallelism (ILP) have significantly increased the computational power of microprocessors. As a result, the demands on the memory system have increased dramatically. Not only do compilers need to be concerned with finding ILP to utilize machine resources effectively, but they also need to be concerned with ensuring that the resulting code has a high degree of cache locality. Previous work has concentrated either on improving ILP in nested loops [3, 6, 7, 14, 16, 17] or on improving cache performance [9, 15, 18]. This paper presents a performance metric that can be used to guide the optimization of nested loops considering the combined effects of ILP, data reuse and latency hiding techniques. We have implemented the technique in a source-to-source transformation system called Memoria [5]. Preliminary experiments reveal that dramatic performance improvements for nested loops are obtainable (we regularly get at least a factor of 2 on kernels run on two different architectures).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An efficient memory operations optimization technique for vector loops on Itanium 2 processors

To keep up with a large degree of instruction level parallelism (ILP), the Itanium 2 cache systems use a complex organization scheme: load/store queues, banking and interleaving. In this paper, we study the impact of these cache systems on memory instructions scheduling. We demonstrate that, if no care is taken at compile time, the non-precise memory disambiguation mechanism and the banking str...

متن کامل

Architectural Analysis and Instruction-Set Optimization for Design of Network Protocol Processors∗

TCP/IP protocol processing latency has been an important issue in high-speed networks. In this paper, we present an architectural study of TCP/IP protocol. We port the TCP/IP protocol stack from the 4.4 FreeBSD to the SimpleScalar simulation environment. The architectural characteristics, such as instruction level parallelism and cache behavior, are studied through simulation. We also compare t...

متن کامل

Practical Precise Evaluation of Cache Effects on Low Level Embedded Vliw Computing

The introduction of caches inside high performance processors provides technical ways to reduce the memory gap by tolerating longmemory access delays. While such intermediate fast caches accelerate program execution in general, they have a negative impact on the predictability of program performances. This lack of performance stability is a non-desirable characteristic for embedded computing. W...

متن کامل

On the Nature of Cache Miss Behavior: Is It √2?

It has long been empirically observed that the cache miss rate decreased as a power law of cache size, where the power was approximately -1/2. In this paper, we examine the dependence of the cache miss rate on cache size both theoretically and through simulation. By combining the observed time dependence of the cache reference pattern with a statistical treatment of cache entry replacement, we ...

متن کامل

Design and Implementation of a Lightweight Dynamic Optimization System

Many opportunities exist to improve micro-architectural performance due to performance events that are difficult to optimize at static compile time. Cache misses and branch mis-prediction patterns may vary for different micro-architectures using different inputs. Dynamic optimization provides an approach to address these and other performance events at runtime. This paper describes a software s...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1996

Combining optimization for cache and instruction-level parallelism

نویسنده

چکیده

منابع مشابه

An efficient memory operations optimization technique for vector loops on Itanium 2 processors

Architectural Analysis and Instruction-Set Optimization for Design of Network Protocol Processors∗

Practical Precise Evaluation of Cache Effects on Low Level Embedded Vliw Computing

On the Nature of Cache Miss Behavior: Is It √2?

Design and Implementation of a Lightweight Dynamic Optimization System

عنوان ژورنال:

اشتراک گذاری